Data and file management in practise

Robert Turner, University of Sheffield RSE Team September, 2021

Data and code management in practise

Acknowledgements

Heavily based on Reproducible Research Data and Project Management in R by Anna Krystalli, naming things by Jenny Bryan and Methods in Research Software Engineering by David Wilby.

About me

Bob Turner

Mix of software engineering and research experience.

RSE Team

RSE

13 RSEs, 35 projects / year worth ~£11m total

Motivation

In this session…

Practical advice on:

  • Data file management
  • File naming
  • “Project” folders
  • Metadata

About you

What operating system(s) do you use?

What programming language(s) do you use?

Data Management

Data Management Plan

  • Start early. Make an RDM plan before collecting data.
  • Anticipate data products as part of your thesis outputs.
  • Think about what technologies to use.

Own your data

Take initiative & responsibility. Think long term.

Spreadsheets?

Do you agree?

Excel

But good for data viewing / entry, sometimes, perhaps…

Databases

Have a look at the Data Carpentry SQL for Ecology lesson

Data formats

  • .csv: comma separated values.
  • .tsv: tab separated values.
  • .txt: no formatting specified.

more unusual formats will need instructions on use.

Ensure data is machine readable

Andrea De Santis, unsplash.com

bad

bad

good

ok

  • could help data entry
  • .csv or .tsv copy would need to be saved.

Basic quality control

Use good null values, missing values are a fact of life:

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values

Data security

Raw data are sacrosanct

Give yourself less rope

Photo by Jon Moore, unsplash.com

  • It’s a good idea to revoke your own write permission to the raw data file. Then you can’t accidentally edit it.
  • It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.

Know your main copies

Photo: Pexels CC0

  • identify the main copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising

How to avoid catastrophes

Backup: on disk

Backup: in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control

Backup: the Open Science Framework osf.io

  • version controlled
  • easily shareable
  • works with other apps (eg googledrive, github)

Backup: Github

  • most solid version control.
  • keep everything in one project folder.
  • Can be problematic with really large files.

Good File Naming

Let’s face it…

  • There are going to be files
  • LOTS of files
  • The files will change over time
  • The files will have relationships to each other

It’ll probably get complicated

File organization and naming is a mighty weapon against chaos

  • Make a file’s name and location VERY INFORMATIVE about:
    • what it is,
    • why it exists,
    • how it relates to other things
  • The more things are self-explanatory, the better.

What works, what doesn’t?

NO

myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt

YES

2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt

Question

What makes a good file name?

Three principles for good (file) names

  1. Machine readable
  2. Human readable
  3. Play well with default ordering

Machine readable

  • Regular expression and globbing friendly
    • Avoid spaces, punctuation, accented characters, case sensitivity
  • Easy to compute on
    • Deliberate use of delimiters

Filtering and search through Globbing

In the following:

ls -lh *Plasmid*
*Plasmid*

is a glob.

Excerpt of complete file listing

Example of globbing to filter file listing

Search using Mac OS Finder

Delimit information with punctuation

Deliberate use of "-" and "_" allows recovery of metadata from the filenames:

  • "_" underscore used to delimit units of metadata I want to access later
  • "-" hyphen used to delimit words so our eyes don’t bleed

Splitting filenames by delimiters

This happens to be R but also possible in the shell, Python, etc.

Include important metadata

e.g. I’m saving a number of files of temperature data extracted at different resolutions (res) and for a number of months (month). Including these parameters in the filename allows me to use them to target files to read in.

write.csv(df, paste("variable", res, month, sep ="_"))
df <- read.csv(paste("variable", res, month, sep ="_"))

Recap: machine readable

  • Easy to search for files later
  • Easy to filter file lists based on names
  • Easy to extract info from file names, e.g. by splitting

New to regular expressions and globbing? be kind to yourself and avoid

  • Spaces in file names
  • Punctuation
  • Accented characters

Human readable

  • Name contains info on content

Example: Which set of file(name)s do you want at 3 a.m. before a deadline?

Embrace the slug

Recap: Human readable

  • Easy to figure out what the heck something is, based on its name

Play well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Examples: Chronological order and Logical order

Chronological order: Order by date / time

Dates

Dates

Use the ISO 8601 standard for dates: YYYY-MM-DD

Logical order: Put something numeric first

Left pad other numbers with zeros

If you don’t left pad, you get this:

10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R

which is just sad :(

Recap: Play well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Recap: Three principles for (file) names

  1. Machine readable
  2. Human readable
  3. Play well with default ordering

Go forth and use awesome file names :)

“Projects”

Where shall I put my data?

File systems

  • Linux / MacOS - home folder
  • Windows - documents folder

A project folder

myproject/
|
├── 01_data/
|   ├── 01_raw/
|   ├── 02_working/
|   └── 03_clean/
|
├── 02_scripts/
|
├── 03_figures/
|
├── 04_paper/
|
├── 05_presentation/
|
├── readme.md
|
└── license.md

R (rrtools)

analysis/
|
├── paper/
│   ├── paper.Rmd       # this is the main document to edit
│   └── references.bib  # this contains the reference list information
│
├── figures/            # location of the figures produced by the Rmd
|
├── data/
│   ├── raw_data/       # data obtained from elsewhere
│   └── derived_data/   # data generated during the analysis
|
└── templates
    ├── journal-of-archaeological-science.csl
    |                   # this sets the style of citations & reference list
    ├── template.docx   # used to style the output of the paper.Rmd
    └── template.Rmd

Dependency Management

Good to include:

  • Python requirements.txt, environment.yml etc.
  • Matlab .prj file (xml)
  • R renv.lock - use renv package

Don’t write your own dependency management.

Follow conventions…

  • …of your programming language.
  • …of your research area.

Signposting

Summary